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Abstract 

Let S be a finite, ordered alphabet, and let x = xiX2---Xn S E". A 
secondary index for x ans-wers alphabet range queries of the form: Given a 
range [a;, a^] C E, return the set -f[a,:ar] — {i \ Xi ^ [ai; a^]}. Secondary indexes 
are heavily used in relational databases and scientific data analysis. It is -well- 
kno-wn that the obvious solution, storing a dictionary for the set Uii^i} with a 
position set associated with each character, does not al-ways give optimal query 
time. In this paper we give the first theoretically optimal data structure for the 
secondary indexing problem. In the I/O model, the amount of data read -when 
ans-wering a query is -within a constant factor of the minimum space needed to 
represent /[oi;ar]j assuming that the size of internal memory is (|E| \gny blocks, 
for some constant ^ > 0. The space usage of the data structure is 0(ri Ig 
bits in the -worst case, and -we further sho-w ho-w to bound the size of the data 
structure in terms of the 0th order entropy of x. We sho-w ho-w to support 
updates achieving various time-space trade-offs. 

We also consider an approximate version of the basic secondary indexing 
problem -where a query reports a superset of I[ai-ar] containing each element 
not in I[ai-ar.] -with probability at most e, -where e > is the false positive 
probability. For this problem the amount of data that needs to be read by the 
query algorithm is reduced to 0(|/[q,;q^]| lg(l/£)) bits. 
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1 Introduction 



Indexing capability is a vital part of database systems, and hundreds of indexing 
methods exist. In this paper we consider indexes that store a multiset of keys from 
an ordered set S, where each key has some associated data. The goal is to support 
range queries finding all keys (and associated data) in a given interval. In databases 
a distinction is made between primary indexes, where the data associated with each 
key is stored in the index itself, and secondary indexes where the index provides 
references to the associated data, which is stored in a way not controlled by the 
index. The distinction is especially important in the I/O model, where the time to 
read the references is usually much smaller than the time to retrieve the associated 
data. This is because the associated data is, in general, unlikely to be located in a 
small number of disk blocks. 

At first glance it would seem that the performance of secondary indexes is not 
too important in the case where the set of returned references is a large fraction of 
all data (in database jargon, where the selectivity is low), since then the time to read 
the associated data will dominate. However, it is common to use several secondary 
indexes in conjunction. For example, in a database of people we may want to find 
all married men of age 33. This can be done by combining information found in 
secondary indexes for the attributes specifying marital status, sex, and age. Only 
associated data matching all three conditions needs to be returned. This means that 
the time spent by the secondary indexes may be dominant, even when retrieving 
the associated data is taken into account. This way of using secondary indexes, 
often referred to as RID intersection, is particularly common in On-Line Analytical 
Processing (OLAP) systems, information retrieval, and scientific data analysis (see 
e.g. [15, 16, 18] for details). 

From a worst case query time perspective it would seem better to support queries 
like the above using a data structure for orthogonal range queries in three dimen- 
sions, e.g., a range tree. This would ensure good query performance in terms of data 
size and result size. However, when the number of dimensions is more than a small 
constant (say, 3) known range reporting data structures either: 

• Use excessive space (e.g., range trees [11, Section 5.3] have space usage that 
grows with (Ign)*^"^), or 

• Have no (provably) good worst-case performance (e.g., kD-trees [11, Section 
5.2] have worst case query time n^~^/'^). 

This is one reason why it is common to perform multi-dimensional range queries 
by intersection of the set of matching points in each dimension, as explained above. 
Another reason is that one-dimensional search structures are usually simpler and 
have lower constant factors than multi-dimensional data structures. 

Finally, using a collection of one-dimensional search structures allows answering 
queries that are more general than orthogonal range queries. Examples are ap- 
proximate range searches ( "find points that are in the range in at least di out of d 
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dimensions") and partial match queries ("find points that match range conditions 
in di given dimensions, where di <^ d"). For many of these problems, all known 
solutions (for high dimensions) are not much better than the brute force solutions 
(either represent all answers, or let queries read most of the data). 

1.1 Problem definition 

Let S be a finite, ordered alphabet, and let x = xiX2 . . . a;„ G S'*. For a set C C S 
we define Ic{^) = {i \ Xi & C}. When the string x is understood we omit it. We 
formalize the secondary indexing problem as follows: Let x = X1X2 ■ ■ - Xn G S". A 
secondary index for x answers alphabet range queries that, given ai,ar G S, return 
the set I[aj.a^](x). We let z = |/[a,;ar]l = 1^1- Without loss of generality we 

assume cr < n (if it is larger, use a dictionary to map to a smaller alphabet). In this 
paper we will consider data structures that output the set in compressed format, 
using O (ig (")) bits. The size of the data structure can be expressed either in terms 
of n and a, or more generally in terms of the 0th order entropy of x. 

In the semi-dynamic version we allow insertions of a new character at the end of 
X. In the fully dynamic version we allow changing the character in a given position. 
We discuss in Section 4 how this is enough to also handle deletions. 

We also consider a generalization of the secondary indexing problem. In the 
approximate secondary indexing problem with parameter e >0 (Section 3) the result 
of queries should be a set I[ai;ar] — ^[af.ar] such that for every i I[ai:ar]i -P^I^ ^ 
-^[a;;ar.]] — ^- T^^^ motivation for this problem is that it may be enough to filter 
away almost all points in the d-dimensional range query application. If a point is 
inside the range in k dimensions, the probability that it will be reported by all d 
approximate range queries is at most e'^^^. False positives can be filtered away 
when accessing the associated data, assuming that the d keys are stored with the 
associated data (which is typically the case in database applications). 

1.2 Previous work 

If S has constant size, an optimal secondary index (up to constant factors) is to 
store for every a G E a bitmap index for the set I{a}j i-^-j the bit string G {0, 1}" 
where there is a 1 in position i if and only if Xi = a. A range query can be answered 
by simply reading the bitmap of each character in the range. 

If S is large it is clear that some bitmaps will be very sparse, and hence storing 
an explicit bitmap for each character will be inefficient in terms of space. This 
problem can be addressed by compressing the each bitmap to a representation that 
uses close to the information-theoretic minimum space for representing vectors of 
a given sparsity. A bit string of length n with m < n/2 Is requires space at least 
(m) ~ mlg(n/m) + 0(m) bits.^ One well-known optimal encoding (within a 
constant factor) is the run-length encoding, where the length x of each run of Os 
is encoded using a gamma code [12] using 2Llg(x + 1)J + 2 bits. Even though the 

^We use Ig to denote the logarithm base 2. 
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bitmaps are not independent, compressing them independently gives a total size of 
the data structure that is within a constant factor of the size of the original string 
X, i.e., 0(n Igo") bits. While gamma coding is asymptotically optimal, compression 
schemes used in practice also take into account the computational effort needed to 
compress and uncompress [18], with some reduction in worst-case compression rate.^ 
In this paper we focus on the number of bits read and written, and hence use run- 
length encoding with gamma codes (or more generally, any method that compresses 
to within a constant factor of minimum size). More formally, we will analyze our 
data structure in the // O model [1] where the cost measure is the number of memory 
blocks read and written. (Note that in this model the minimum amount of data 
read is 1 block, i.e., we count block I/Os and not merely the amount of data read.) 
While recent studies [18, 17] indicate that computation time can be a bottleneck 
when handling (compressed) bitmap indexes, we find it likely that I/O is going to be 
a future bottleneck, as the parallel processing power of CPUs is rapidly increasing. 

Performing a range query by reading the compressed bitmap for each character in 
the range does not always give the best query time we could hope for. For example, 
if each character occurs n/u times and we make a range query of size i the output 
is a set of size nl/a, which can be represented in 0(^ Igia/l)) bits. However, the 
total size of the individual bitmaps is 6(^lg(j), meaning that we are reading a 
factor r2(lg(cr)/lg(cr/£)) more bits than the output size. When £ = ^{a) this is a 
factor Q(lgc7) from optimal. 

There have been a number of papers trying to use some kind of prccomputation 
to allow faster range queries. Some of these, such as range encoding [14] and interval 
encoding [9, 10] use space na^""^^-* bits [16]. In this paper we are interested in 
schemes where the precomputed data structure uses space close to the minimum 
possible for representing x. A more space-conscious approach to range queries is 
binning (see [16]). In its simplest form the idea is to divide E into bins of w characters 
and represent a compressed bitmap for each bin corresponding to all occurrences 
of its characters. This means that a range query where the range has size I can 
be answered by combining less than \^/w\ + 2w compressed bitmaps. Using this 
idea recursively one gets multi-resolution bitmap indexes [16]. Though not analyzed 
in [16] the worst-case space usage of such an index, when each bitmap is optimally 
compressed, is Q{n\^{a) / \gw) bits. Queries may in the worst case require reading 
a factor 0(lgu;) more data than in the size of the output. This means that there 
is a time-space trade-off, and one can never simultaneously achieve optimal space 
for the data structure and optimal query time. In fact, a more general scheme is 
discussed in [16] that allows the bucket size to be different on the various resolution 
levels, but even this does not seem to yield any worst-case improvement. 

^Some bitmap indexing schemes such as [18] claim space optimahty, but this is in comparison 
with indexes that represent all data explicitly, using at least log n bits per key, which is not optimal 
for dense sets. 
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1.3 Our results 



In the static setting, we obtain a data structure that simultaneously achieves two 
goals: 

• Space usage that is within a constant factor of the size of the string x. In 
fact, the size of our data structure is within a constant factor of the 0th order 
entropy of x, plus 0{n) bits. This is up to a factor Cl{\g a) less than the explicit 
representation of x. 

• Time usage for range queries that is within a constant factor of what would 
be needed to read the result, had it been precomputed. This is up to a factor 
r2(lgn) less than the time needed to read the explicit list of positions in the 
result. 

Our result improves previous results, all of which exhibit a time-space trade-off. 
We also show how to make our structure dynamic, which is something that has 
not been achieved by earlier data structures. Depending on the time allowed for 
updates, we achieve the same, or nearly the same, query time. Finally, we show how 
to support Bloom filter-like approximate queries with improved efficiency. 
The main conceptual and technical contributions of the paper are: 

• Formulation of the theoretical problem: Secondary indexing with worst-case 
optimal space and query time. This gives a unified view of secondary index 
performance, with B-trees and uncompressed bitmap indexes at the extremes. 

• A new multi-resolution bitmap indexing scheme that (ignoring constant fac- 
tors) does not exhibit a trade-off between space and query time. Section 2.2. 

• A dynamization of the data structure (Section 4). A component of this, which 
is of independent interest, is a dynamic, buffered bitmap index (Section 4.2). 
The dynamization is mainly a technical contribution, requiring the use of 
several known ideas. 

• An I/O efficient way of supporting approximate range queries. The set re- 
turned by such a query is rather large, but we show how it can be highly 
compressed such that the representation is significantly smaller than that of 
an exact result. 

1.4 More notation 

We define the cardinality of a bitmap S to be the number of Is in it. Let T denote 
the size of the (optimally) compressed output in bits. It will be convenient for us 
to measure the block size B of the I/O model in bits, rather than words. Similarly, 
M will denote the size of internal memory in bits. We also use the parameter b 
to denote the block size in "words", i.e., b = Q{B/lgn), where n is the size of the 
input. We assume that B >\gn, and also that 6 > 2. Let S = {oi, 02, . . . , aa} with 
ai < 02 < • ■ ■ < Co-. 
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2 Our secondary index 



To simplify the description of the structure below, we assume cr to be a power of 2. 
The structure and its analysis can be easily modified to work for the general case. 

2.1 A suboptimal solution. 

As a warm-up we first describe a simpler structure that introduces some of the ideas 

used later, but only gives a suboptimal result. The data structure is a variant of 
multi-resolution bitmap indexes [16], and is not space optimal, but it is a suitable 
stepping stone for the optimal solution described in Section 2.2. 

Consider the complete binary tree U with a leaves identified from left to right 
with the sequence ai, 02, • • • , Oo-. With the leaf ai we associate the bitmap la^ (x), 
and with each internal node v, we associate the bitmap I[aj.„^](x) where a; and 
are the leftmost and rightmost leaves below v, respectively. 

Let vi,V2,-- - , be the nodes, in left-to-right order, at level j (the root being at 
level 1), and let rij be the cardinality of the bitmap associated with the node fj. We 
store the compressed bitmaps of all the nodes at each level in their left-to-right order. 
The space used by all the compressed bitmaps at the jth level is O {Yld=i (« )) • 
This summation is maximized when each of the n^s is equal to n/2-', and in this case 
the space used by the jth level compressed bitmaps is O {YI^Li Ig (n/2j)) ' which is 

0{nj) bits. Hence the total space used by all the levels is Y^j'=QO{nj) = 0{n\^ a) 
bits. We store the bitmaps of all the nodes in their level order (from top to bottom, 
and form left to right in each level). For each node, we also store the position and 
length of its compressed bitmap, which takes 0{(T\gn) bits overall. 

We store an array A of length a + 1 where A[i\ stores the cardinality of the 
bitmap /[aj.„j(x), for i = 1, . . . , cr, and ^[0] = 0. The cardinality of /[aj.a^](x) is 
z = A[r\ — A[l — 1]. If 2; > n/2, then instead of computing the answer to the 
original query /[^j^.^j^j (x), we compute the answers to the two queries /[aj.aj_j](x) and 
^[ar+i\aa\i'^)i ^'^d return their union (which is the complement of the query result). 
We now show how to answer a query for /[„^.(j^](x) assuming z < n/2. 

To answer an alphabet range query, /[^^ .^j^] (x) , we first observe that any con- 
secutive range of leaves can be covered by the disjoint union of 0{lga) subtrees of 
U (by taking the maximal subtrees for which all leaves are within the range - at 
each level, there will be at most two maximal subtrees whose leaves are within the 
range). We compute the compressed bitmap of their union by merging the bitmaps. 
Assuming that the size M of internal memory is at least B log a this can be done in 
a single pass. 

Since the cardinalities of the bitmaps associated with subtrees decrease by a 
factor of 2 as we go down by one level in the tree, we can argue that the sum of the 
sizes of these O(lgcT) subtrees (where there are at most two subtrees at each level) 
is at most 4 times the size of a subtree that is present at the highest level. Also, 
the length of the compressed bitmap associated with a subtree that is present at 
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the highest level is a lower bound on T, since z <n/2. Thus the overall number of 
I/Os is 0(lgo") phis an 0{T/B) term for reading the bitmaps. 

The result of the warm-up case is the following theorem, which we improve later. 

Theorem 1 A string x = xiX2 . . . x„ G over a finite alphabet S of size a can he 
stored using 0(n Ig^ a) hit^ such that range queries can he answered in 0{T/B+lga) 
I/Os. 

2.2 Optimal space structure. 

We now describe how to extend the above scheme to the case of non-uniform distri- 
bution, and at the same time reducing the space to optimal (assuming cr <C n). For 
simplicity we assume that no character has more than n/2 occurrences. If this is 
not the case we may expand the alphabet and substitute half of the occurrences of 
the most common character with a new character, increasing the 0th order entropy 
by 0(n) bits. A main tool is to use a "weight balanced tree" W on the multi-set 
of characters occurring in x, instead of the complete binary tree U on the alphabet 
(used in the case of uniform distribution). Each of the n characters is associated 
with its position in x; their ordering in the tree is determined primarily by the order 
on S, secondarily by the ordering of positions. 

Our starting point is a weight-balanced B-tree from [4] with constant maximum 
degree. As in other B-tree variants, all leaves are at the same distance from the root. 
The essential property we need is that in a weight-balanced B-tree with branching 
parameter c > 4 (a constant) and leaf parameter 1 we can efficiently maintain that 
the weight of a node (number of leaves in the subtree below v) at level i from the 
bottom is between ^c* and 2c*. This implies that the maximum degree is 4c and 
the depth is O(lgn). The weight of a node is stored with the node. Note that the 
weight of a node at level i from the top is 0(n/c*) (this is the bound we will actually 
use). 

With each internal node v of the tree, we associate a compressed representation 
of a bitmap Sy of length n where Sy[i] = 1 if the character in position i is in the 
subtree rooted at v, and Sy[i] = otherwise. We now prune this tree by removing 
all the children of an internal node v if all leaves below v contain the same character. 
In this pruned tree, each character appears at most 8c times at each level as a leaf 
(if it appears more than 8c times as a leaf at a particular level, then since all these 
leaves must be adjacent, a subset of them will be the only children of their parent 
and hence should have been removed by the pruning procedure as their parent would 
be associated with a single character). Thus the total number of leaves, and hence 
the total number of nodes in the pruned tree is 0{algn), for constant c. We use 
W to denote this pruned tree. We define the weight of a node in the pruned tree to 
be the cardinality of the bitmap associated with it (which is same as its "weight" 
in the tree before pruning). 

^The space bound can be improved to 0{nlga + alg^n) bits while improving the query time 
to 0{T/ B + Ig Ig cr + Igj, o") 1/ Os by using the ideas described in Section 2.2. 
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A naive upper bound. Since the sum of the cardinalities of all bitmaps stored 

with nodes at any particular level is at most n and since each bitmap of a node 
at level i has cardinality 0(n/c*), the compressed bitmap of a node at level i takes 
Q{ni/d) bits (recall that c is constant). Since there are 0(c*) nodes at level i the 
total space at this level is 0{ni) bits, and because the height of the tree is 0(lg n), the 
overall space used by all the levels is O(nlg^n) bits. The tree structure (including 
pointers to the bitmaps at each node, but not including the bitmaps themselves) is 
laid out on the disk such that any root-to-leaf path can be traversed using O(lg^n) 
I/Os. More specifically, starting from the root, we store the top d = 0(lg b) levels in 
a block with pointers to each of the subtrees at level d-\- 1. Each of the subtrees at 
level d + I are recursively stored in the same fashion. We merge the blocks so that 
no block is more than half empty. Thus the total space is within a constant factor of 
the space needed to store the tree without "blocking". Since we can traverse 0{lgb) 
levels in any root-to- leaf path using one I/O, any root-to- leaf path can be traversed 
using 0(lgf,n) I/Os. The compressed bitmaps of all nodes are stored in level order. 

A range query I^ai;ar]i^) t)e partitioned into O(lgn) subtrees where there are 
only a constant number of subtrees at each level. These subtrees can be identified 
by traversing the tree top down to find the leftmost and rightmost leaves associated 
with the characters ai and respectively, using 0(lgf,n) I/Os. More specifically, 
starting from the root, at each level we go to the leftmost (rightmost) child of the 
current node which is associated with a; (a^), while including subtrees rooted at all 
the right (left) siblings of the child to the list of subtrees to be merged. If z is the 
cardinality of the query result, then there are 0(1) subtrees of weight between z 
and z/c, 0(1) subtrees of size between z/c and z/c^, . . . , 0(1) subtrees of size 1. 
Thus the total size of all the compressed bitmaps we need to read (to compute their 
union) is at most 0(1)(2; lg(n/2;) + (z/c) lg(cn/2;) -|- • • • -|- Ig n) = 0{z\g{n/z)), which 
is asymptotically optimal. For each compressed bitmap read we may use up to 2 
I/Os to read blocks that do not contain B bits of the compressed bitmap. That is, 
we waste O(lgn) I/Os compared to the smallest possible number of I/Os required 
to read the compressed bitmaps. 

Space improvement. We now show how to improve the space of the above 
scheme to OinHo) bits, where Hq is the 0th order entropy of the string x, while 
retaining the query time. The main idea is to store bitmaps only on a few levels 
of the tree explicitly instead of storing all of them. More specifically, if h is the 
height of the tree W, we store the Oilgh) levels numbered 1,2,4,8, . . . (from the 
top), and also store all the leaves explicitly. We refer to the levels that are stored 
explicitly (including the leaf level) as the materialized levels. In addition, we store 
the structure of the entire tree W, without removing any levels. 

We now analyze the space usage. Consider an instance of a character that is 
stored at level i in the tree. It contributes to the space usage of one bitmap index at 
each of levels 1, 2, 4, . . . , 2Lig^J , which means a total of 0{i) bits. In other words, the 
total space is within a constant factor of the space needed for the bitmap indexes 
stored at the leaves. 
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Now consider the leaves corresponding to a single character a G S with Za 
occurrences. As noted above, at any level there are at most 8c leaves associated 
with a. Further, the heaviest leaf associated with a has weight Q(za/c) (because 
a constant fraction of the total weight must come from the uppermost level). As 
argued earlier, the exponential decrease in weight down the tree means that the 
space usage is dominated by the uppermost level, which uses space 0{zalg{n/za))- 
Summing over all characters we get X^aGS 0{zalg{n/za)) = OIuHq) bits, where Hq 
is the 0th order entropy of the string x. In addition, the tree has 0(a"lgn) nodes, 
each of which stores a pointer to the bitmap corresponding to that node. Since 
each pointer can be represented using O(lgn) bits, the total space used by the tree 
structure is 0{alg^n) bits. 

To answer a query we use the same algorithm as in the "naive upper bound", 
except that when we need a bitmap index that is not explicitly stored, it is computed 
by merging the bitmaps stored with all the nearest descendants that are in the 
materialized level immediately below. The weight of a node is used to figure out 
the total cardinality of the bitmaps to be read from the next materialized level. We 
store the bitmaps of all the internal nodes at each materialized level by concatenating 
them in their left-to-right order. The set of all bitmaps to be read at a level form two 
consecutive chunks in the concatenated sequence of bitmaps. So at each materialized 
level, and at the leaf level, 0(1) I/Os are wasted reading the data that is not needed 
to form the answer. Assuming that M = B{a\gn)^^^^ we can merge these 0{a\gn) 
bitmaps in 0(1) passes (the weights of the nodes can be used to find the starting 
positions of the compressed bitmaps), meaning that the number of I/Os is within a 
constant factor of the I/Os needed to read the individual bitmaps. 

We now analyze the query time. Searching the tree W to find all the relevant 
subtrees to be merged requires 0(lgi n) I/Os. From these subtrees, one can compute 
the starting positions and the lengths of all the bitmaps to be read from all the 
materialized levels (without any additional I/Os). The space needed to represent 
the compressed bitmap of a node at level i in the tree is bounded by a constant factor 
(two) times the space used by its lowest ancestor that is stored explicitly (i.e., the 
ancestor of the node at level 2^^^^^). This means that the number of bits we need to 
read is within a constant factor of the case where all bitmaps are explicitly stored, 
which is the case analyzed in the "naive upper bound" above. In addition, we waste 
0(1) I/Os in reading the relevant bitmaps at each materialized level, and hence 
overall O(lglgn) I/Os are wasted. 

Theorem 2 A string x = 0:1X2 . . . x„ G over a finite alphabet S of size a can he 
stored using 0{nHo + n + a\g^ n) bits, where Hq is the 0th order entropy o/x, such 
that range queries can be answered in 0(z lg(n/z)/i? -)- Ig^ n -|- Iglgn) I/Os, where z 
is the cardinality of the answer to the query, assuming M = 5(o-lgn)^(i). 
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3 Approximate queries 



We now consider how to reduce the query time by allowing queries to return false 
positives, in the spirit of Bloom filters [6, 8]. More precisely, a query reports a set 
I[ai;ar] ^ I[ai;arh where for each i ^ I[ai;ar] tlie probability that i G -f[a,;a,] is at most 
e. The parameter e is supplied as an argument to the query algorithm. Approximate 
secondary indexing was recently considered by Apaydin et al. [2], but their query 
algorithm is optimized for a RAM model meaning that it has many random accesses 
and thus poor performance in the I/O model. 

We use the technique of Carter el al. [8], as further developed by Bille et al. [5], 
for converting the problem of storing a set with e false positive rate to the prob- 
lem of storing exactly a set within a smaller universe. Specifically, whenever the 
exact data structure described in Section 2.2 stores a set of positions S C [n], the 
approximate data structure additionally stores a sequence of A; = [IglgnJ hashed 
sets hi (S) , . . . ,hk (S) , where hj : U ^ [2^^ ] is a function chosen at random from a 
universal family. The same k functions are used in each node, and we group the sets 
according to what hash function was used. At a node where a set / is stored, the 

hashed sets occupy 0(X^j=i Ig (m)) = 0{lg (|"|)) bits. This means that the total 
space needed to store the hashed sets, as compressed bitmaps, is dominated by the 
space needed to store S. 

When processing a query for the interval [a;; Ur] the first step is to compute the 
size z of the result I[ai;ar]- This can be done efficiently using the weight-balanced 
B-tree. Then we choose j as the smallest integer such that 2^^ > z/e. If j > k 
we cannot save anything by returning an approximate result, so we answer the 
query exactly as described in Section 2.2. Otherwise we compute hj{I^a.i;ar]) ^y 
taking the union of the jth hashed sets (rather than the union of the position sets 

themselves). By the analysis in Section 2.2 the number of bits read is 0(lg (^^ )), 
which is 0{zlg{l/e)) by our choice of j. 

Finally, we let I[ai;ar] the preimage {hj{I^g:i;ar])) the hashed result. For 
many common universal families the preimage can be computed with a small effort. 
Note that we do not want to output the preimage (it is quite large), but only to 
generate it without using any further I/Os. In the context of high-dimensional range 
queries it is also important that we can efficiently compute the intersection of several 
approximate query results, but this is easy: Simply compute the preimage of the 
intersection. 

We describe a well-known and particularly attractive universal family. Split a 
number i G [n] into two parts {11,12) where 12 is the 2^ least significant bits of i and 
ii is the [lg(n -i- 1)] — 2^ most significant bits of i. Then take any universal family 
H. mapping to [2-?], pick gj £ Ti uniformly at random and let hj {11,12) = gj{ii) ® ^2 
where © denotes bitwise exclusive or. The family from which hj is chosen can easily 
be seen to be universal. Then the set of indices (?i,Z2) that map to s G {0, 1}^^ is 
hf{s) = {{h,s © gj{h)) I ii = 0, 1, 2, . . . }. 

To argue that the desired error bound is met, consider i I[ai;ar] ■ universality. 
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the probability that hj{i) G hj{I[ai;ar]) is at most < e. Since i G i[ai;ar] if 

and only if hj{i) G hj{I[ai;ar]) this completes the argument. We get the following 
approximate variant of Theorem 2. 

Theorem 3 A string x = xiX2 ■ ■ ■ Xn G over a finite alphabet E of size a can 
be stored using 0{nHo + n + alg^n) bits, where Hq is the 0th order entropy of 
such that approximate range queries with false positive probability e can be answered 
in 0{zlg{l/£)/B + Ig^n + Iglgn) I/Os, where z is the cardinality of the answer 
to the query, assuming M = B{a\gn)^^^^ . The I/O bound captures the time for 
generating the result, but not for outputting it. 

As shown by Carter et al. [8] the space needed to represent a set of size z 
approximately, with false positive probability e, is 0(2;lg(l/e)) bits, so the query 
time is optimal whenever z is not small (e.g., when 2; > Ign). 

4 Supporting updates 

Given a string x over the alphabet S, we consider the following update operations, 
for some a G S: 

• appendix, a) : append the character a at the end of x, and 

• change{yi,i,a): change the zth character of x to a. 

We note that deletions are indirectly supported through these operations: Ex- 
tend the alphabet with a new character 00 that is newer matched by a range query. 
Deleting a character can be done by simply changing it to 00. If deletion markers 
are similarly used in the table being indexed this yields the desired semantics: The 
positions of characters do not change when deletions are performed. It is, however, 
simple to extend this to the more natural semantics where character positions are al- 
ways relative to the current string: Maintain a B-tree over the deleted positions with 
subtree sizes maintained in all nodes — this allows translating positions back and 
forth between the two systems using ©(log^n) I/Os, and space 0{n) bits (positions 
in leaf nodes should be efficiently encoded, e.g., using gamma-coded differences). If 
the number of deleted characters exceeds a constant fraction of all characters, global 
rebuilding is performed to reduce the space. 

4.1 Semi-dynamic version 

We first consider only updates that append characters to the end of the string. This 
is motivated by the fact that OLAP and scientific data, for which bitmap indexes 
have been shown to be very effective, are typically read and, append only [16]. We 
describe two ways to modify the weight balanced B-tree structure used in Section 2.2 
to support append, achieving different time-space trade-offs. 

A straightforward way of supporting updates to the structure described in Sec- 
tion 2.2 is to perform the update on all the bitmaps that are affected by it. Since one 
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bitmap in each materialized level (namely the one corresponding to the last occur- 
rence of that character) will be affected by an update, we need to update O(lglgn) 
bitmaps. This can be done efficiently by maintaining an array of size O(lglgn) 
for each character a G S. The ith entry in the array corresponding to a stores a 
pointer to the disk block containing the last occurrence of a among all bitmaps at 
the ith materialized level. We also maintain back pointers so as to update these 
arrays efficiently whenever the blocks pointed to by the array are reorganized. This 
requires an additional 0{a\glgn) pointers, using 0(c7lgnlglg?7.) bits. 

After performing an update, if the weight-balancing condition at a node is vi- 
olated, then the subtree rooted at the parent of that node is rebuilt to maintain 
the weight-balancing condition. More formally, let u be a node at level i from the 
top, and let u be the parent of v. Let v be the highest level node at which the 
weight-balancing condition is violated after an update. To maintain the weight- 
balancing condition, we re-build the subtree rooted at u, and recompute the new 
bitmaps associated with all the nodes in the subtree. This can be done bottom-up 
in level order. The bitmaps associated with leaves need not be recomputed. The 
bitmaps associated with all the nodes at a materialized can be computed by merg- 
ing the bitmaps of all the descendants at the materialized level immediately below. 
Computing the compressed bitmap of a node by merging the compressed bitmaps 
of all its descendants (in the materialized level immediately below) requires 0(1) 
passes assuming M = Ba^^^'^ , where each pass scans all the bitmaps that need to be 
merged exactly once. Also the size of all the bitmaps at a any level within a given 
subtree is at most the size of all the bitmaps at the leaves within the subtree. Thus 
overall, the number of I/Os needed to compute the bitmaps of all the nodes in the 
subtree is at most Ig Ig n (the number of levels) times the number of I/Os needed to 
scan the bitmaps of all the leaves in the subtree. As the weight of u is 0(n/c*~^), 
the size of the bitmaps associated with all the leaves below u is 0((n/c*^^) lg(c*~^)) 
bits. Hence the cost of scanning the bitmaps associated with all the leaves in the 
subtree rooted at u is 

(l/i3)(lg Ign - i)0{in/c'-^) lg(c*-^)) I/Os. 

The total cost of rebalancing u can be charged to the 6(n/c*) updates performed in 
the subtree rooted at v (which caused the violation in the weight-balancing condition 
at v), making the amortized cost of rebalancing Q( (^s^^") ) which is 0{l/h) I/Os. 

Thus this structure extends the structure of Theorem 2 by supporting updates 
in O(lglgn) amortized I/Os (while retaining the space and query bounds). 

Theorem 4 A string x = xiX2 ■ ■ ■ Xn € over a finite alphabet S of size a can 
be stored using 0{nHo + alg^n) bits to support range queries in 0{z\g{n/ z) / B + 
lglgn + lgj,n) I/Os, and append in amortized O(lglgn) I/Os. 

4.1.1 Trading off space for faster updates. 

To get faster updates, the main idea is to use buffers with the internal nodes to 
store the updates, similar to buffer trees [3] and buffered B-trees [7, 13], instead of 
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performing them right away. With each internal node of the tree W, we associate a 
buffer of size B bits. (Note that we also associate buffers with nodes that have no 
explicitly stored bitmap.) The total space used by all the buffers is 0{Balgn) bits, 
as W contains 0{algn) nodes, each of which is associated with a buffer of B bits. 

We now describe how to support append. To perform append{a,x), we first 
insert the instruction into the buffer of the root, which is always kept in the internal 
memory. When the buffer at a node u becomes full, we find a child of n on which 
at least a (fixed) constant fraction of these updates have to be performed. Since the 
degree of each node in W is bounded by 4c (a constant), such a node always exists. If 
node u is stored explicitly, then we perform these updates on the bitmap associated 
with u. We then delete those updates from the buffer at u and insert them into 
the buffer at node v. Since the buffer size is B bits, Q(B/lgn) = 0(6) updates 
are moved from u to v. Thus the amortized number of I/Os required to perform 
an update on all the nodes on a root-to- leaf path is 0{^^). The amortized cost of 
rebalancing the weight-balanced tree can be shown to be O(^) I/Os as mentioned 
before (the amortized cost of reading all the buffers in the subtree being rebalanced 
is negligible). 

To answer a query, apart from performing the query algorithm described in 
Section 2.2, we also need to read each of the buffers associated with all nodes that 
could potentially contain updates that are part of the answer to the query. Also, 
whenever we read the bitmap stored at a node, we do not need to look at the buffers 
associated with any of its descendants, as all the updates stored in the buffers at 
the descendants of a node have already been performed on the bitmap associated 
with that node. Hence the number of buffers we need to read to answer a query is 
only O(lgn). Thus we have 

Theorem 5 A string x = xiX2 ■ ■ ■ Xn G over a finite alphabet T, of size a 
can be stored using 0{nHo + algn{B + Ign)) bits to support range queries in 
0{z\g{n/z)/B + Ign) I/Os, and append in amortized O(^) I/Os. 

4.2 Buffered compressed bitmap index 

Let X = xiX2...Xn be a string over a finite alphabet S. Given a € S, a point 
query returns (a compressed representation of) the set /q,(x) = {^l^j = a}. In this 
section, we first develop a structure that dynamizes the standard bitmap index while 
supporting point queries efficiently. We then use this structure (in Section 4.3) to 
obtain a fully dynamic structure supporting alphabet range queries efficiently. The 
structure described below supports point queries in 0{lgn + T/B) I/Os where T is 
the total size of the output, and updates in O(^) I/Os. 

The main idea is to store the compressed bitmaps of all the characters in a buffer 
tree. First, each bitmap is represented in compressed form as a list of positions of 
Is, in increasing order in the bitmap, and this list is stored in a sequence of blocks. 
The first position in each block is stored as an absolute value, and all the others are 
stored relative to the previous position (i.e., the gaps are encoded) using gamma 
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codes. We concatenate the representations of all bitmaps together and store them 
as a sequence of blocks. 

Let s = 0{nHQ/B) be the number of blocks used by the compressed bitmaps of 
all the characters. Assuming B > 41gn, we can bound the number of blocks used 
by the representations of all the blocks in terms of s. Store the original compressed 
bitmaps of the given materialized level with blocks oi B/2 bits each, and hence an 
overall 2s blocks. By increasing the block size to B/2 + 21gn, we can make the 
first gamma code in each block to be an absolute value instead of the being relative 
to the previous position (using at most Ign additional bits), and the last code to 
be completely contained within the block in case it is split between two adjacent 
blocks (by using an additional Ign bits). Thus the new representation of the blocks 
requires at most 2s blocks of size at most B bits each. 

With these blocks as the leaves, we construct a tree with branching factor c, 
for some fixed constant c > 2. With each internal node of this tree, we associate a 
buffer of size B bits that stores a set of updates that arc yet to be performed in one 
of the leaves in the subtree rooted at that node. Each non-leaf block also stores an 
identifier for the first bitmap that is (partially) stored in the subtree, to allow fast 
navigation to a particular bitmap. 

An update is simply stored in the buffer corresponding to the root, which is 
always kept in the internal memory. Whenever a buffer becomes full, a constant 
fraction of the updates in that buffer are moved to one of its children. The amortized 
cost of updates can be shown to be O(^) I/Os as before. The space usage of this 
structure is of the same order as the total space used by all the individual compressed 
bitmaps, as the space usage is dominated by the leaf level bitmaps. 

A point query can be answered by following a root-to-leaf path in the tree fol- 
lowed by reading the compressed bitmap stored in the consecutive leaves starting 
at the end of the search path. In addition, we also need to merge this compressed 
bitmap with all the updates corresponding to the character, stored in the buffers. 
The number of buffers storing updates corresponding to a given character can be 
shown to be 0(T/B -|-lg n) where T is the size of the compressed bitmap of the given 
character. Assuming M > S Ign, we can perform this merging in 0(1) passes. Thus 
we have 

Theorem 6 A string x = xiX2 . . . x„ € S"' over a finite alphabet S of size a can be 
stored using O^uHq) bits to support point queries in 0(T/B + Ign) I/Os, where T 
is the size of the answer, and updates in amortized O(^) I/Os. 

4.3 Fully dynamic version 

We first observe that all the bitmaps stored at any particular materialized level of the 
optimal space structure described in Section 2.2 can be thought of as representing 
a bitmap index over an alphabet containing one character corresponding to each 
node in that level. Thus we can obtain a fully dynamic secondary bitmap index by 
representing each of the materialized levels as a buffered bitmap index. 
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More formally, we represent all the bitmaps at each materialized level of the 
structure described in Section 2.2 using buffered bitmap index structure. The total 
space usage is 0{nHQ + alg^ n) bits (for representing all the bitmaps in all the ma- 
terialized levels, and for the tree structure). To perform an update we perform it on 
each of the O(lglgn) materialized levels. Thus updates take amortized 0( ^^"'|^^" ) 
I/Os. An alphabet range query can be decomposed into 0(1) point queries on each 
materialized level, and can be answered using 0(lgnlglgn + z\g(n/z)/B) I/Os, 
where z is the cardinality of the answer to the range query. 

Theorem 7 A string x = xiX2 ■ ■ ■ Xn G S" over a finite alphabet E of size a can 
be stored using 0{nHQ + crlg'^n) bits to support range queries in 0{zlg{n/ z)/B + 
Ignlglgn) I/Os, and updates in amortized (9( ^s"^sig" ^ I/Os. 

One can also achieve other trade-offs between space and operation times by 
choosing to store all the levels of W explicitly and using buffers at the internal 
nodes. 
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